Search CORE

194 research outputs found

PAC-Bayesian High Dimensional Bipartite Ranking

Author: Guedj Benjamin
Robbiano Sylvain
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

This paper is devoted to the bipartite ranking problem, a classical statistical learning task, in a high dimensional setting. We propose a scoring and ranking strategy based on the PAC-Bayesian approach. We consider nonlinear additive scoring functions, and we derive non-asymptotic risk bounds under a sparsity assumption. In particular, oracle inequalities in probability holding under a margin condition assess the performance of our procedure, and prove its minimax optimality. An MCMC-flavored algorithm is proposed to implement our method, along with its behavior on synthetic and real-life datasets

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

UCL Discovery

An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization

Author: Alquier Pierre
Guedj Benjamin
Publication venue: 'Allerton Press'
Publication date: 01/01/2017
Field of study

The aim of this paper is to provide some theoretical understanding of quasi-Bayesian aggregation methods non-negative matrix factorization. We derive an oracle inequality for an aggregated estimator. This result holds for a very general class of prior distributions and shows how the prior affects the rate of convergence.Comment: This is the corrected version of the published paper P. Alquier, B. Guedj, An Oracle Inequality for Quasi-Bayesian Non-negative Matrix Factorization, Mathematical Methods of Statistics, 2017, vol. 26, no. 1, pp. 55-67. Since then Arnak Dalalyan (ENSAE) found a mistake in the proofs. We fixed the mistake at the price of a slightly different logarithmic term in the boun

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Polytechnique

Pycobra: A Python Toolbox for Ensemble Learning and Visualisation

Author: Desikan Bhargav Srinivasa
Guedj Benjamin
Publication venue
Publication date: 01/04/2018
Field of study

We introduce \texttt{pycobra}, a Python library devoted to ensemble learning (regression and classification) and visualisation. Its main assets are the implementation of several ensemble learning algorithms, a flexible and generic interface to compare and blend any existing machine learning algorithm available in Python libraries (as long as a \texttt{predict} method is given), and visualisation tools such as Voronoi tessellations. \texttt{pycobra} is fully \texttt{scikit-learn} compatible and is released under the MIT open-source license. \texttt{pycobra} can be downloaded from the Python Package Index (PyPi) and Machine Learning Open Source Software (MLOSS). The current version (along with Jupyter notebooks, extensive documentation, and continuous integration tests) is available at \href{https://github.com/bhargavvader/pycobra}{https://github.com/bhargavvader/pycobra} and official documentation website is \href{https://modal.lille.inria.fr/pycobra}{https://modal.lille.inria.fr/pycobra}

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

UCL Discovery

A Quasi-Bayesian Perspective to Online Clustering

Author: Guedj Benjamin
Li Le
Loustau Sébastien
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2018
Field of study

When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. We introduce a new and adaptive online clustering algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e., time-dependent) estimation of the (unknown and changing) number of clusters. We prove that our approach is supported by minimax regret bounds. We also provide an RJMCMC-flavored implementation (called PACBO, see https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a convergence guarantee. Finally, numerical experiments illustrate the potential of our procedure

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

UCL Discovery

Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

Author: Guedj Benjamin
Li Le
Publication venue
Publication date: 08/05/2019
Field of study

When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. Principal curves act as a nonlinear generalization of PCA and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation (called \texttt{slpc}, for Sequential Learning Principal Curves) that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret computation and performance on synthetic and real-life data

arXiv.org e-Print Archive

Wasserstein PAC-Bayes Learning: A Bridge Between Generalisation and Optimisation

Author: Guedj Benjamin
Haddouche Maxime
Publication venue
Publication date: 14/04/2023
Field of study

PAC-Bayes learning is an established framework to assess the generalisation ability of learning algorithm during the training phase. However, it remains challenging to know whether PAC-Bayes is useful to understand, before training, why the output of well-known algorithms generalise well. We positively answer this question by expanding the \emph{Wasserstein PAC-Bayes} framework, briefly introduced in \cite{amit2022ipm}. We provide new generalisation bounds exploiting geometric assumptions on the loss function. Using our framework, we prove, before any training, that the output of an algorithm from \citet{lambert2022variational} has a strong asymptotic generalisation ability. More precisely, we show that it is possible to incorporate optimisation results within a generalisation framework, building a bridge between PAC-Bayes and optimisation algorithms

arXiv.org e-Print Archive

PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales

Author: Guedj Benjamin
Haddouche Maxime
Publication venue
Publication date: 24/04/2023
Field of study

While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under that assumption, we extend previous results from \citet{kuzborskij2019efron}. Our key technical contribution is exploiting an extention of Markov's inequality for supermartingales. Our proof technique unifies and extends different PAC-Bayesian frameworks by providing bounds for unbounded martingales as well as bounds for batch and online learning with heavy-tailed losses.Comment: New Section 3 on Online PAC-Baye

arXiv.org e-Print Archive

Differentiable PAC-Bayes Objectives with Partially Aggregated Neural Networks

Author: Biggs Felix
Guedj Benjamin
Publication venue: 'MDPI AG'
Publication date: 22/06/2020
Field of study

We make three related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC-Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of \emph{partially-aggregated} estimators; (2) we show that these lead to provably lower-variance gradient estimates for non-differentiable signed-output networks; (3) we reformulate a PAC-Bayesian bound for these networks to derive a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. This bound is twice as tight as that of Letarte et al. (2019) on a similar network type. We show empirically that these innovations make training easier and lead to competitive guarantees

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

INRIA a CCSD electronic archive server

Kernel-Based Ensemble Learning in Python

Author: Desikan Bhargav Srinivasa
Guedj Benjamin
Publication venue: 'MDPI AG'
Publication date: 17/12/2019
Field of study

We propose a new supervised learning algorithm, for classification and regression problems where two or more preliminary predictors are available. We introduce \texttt{KernelCobra}, a non-linear learning strategy for combining an arbitrary number of initial predictors. \texttt{KernelCobra} builds on the COBRA algorithm introduced by \citet{biau2016cobra}, which combined estimators based on a notion of proximity of predictions on the training data. While the COBRA algorithm used a binary threshold to declare which training data were close and to be used, we generalize this idea by using a kernel to better encapsulate the proximity information. Such a smoothing kernel provides more representative weights to each of the training points which are used to build the aggregate and final predictor, and \texttt{KernelCobra} systematically outperforms the COBRA algorithm. While COBRA is intended for regression, \texttt{KernelCobra} deals with classification and regression. \texttt{KernelCobra} is included as part of the open source Python package \texttt{Pycobra} (0.2.4 and onward), introduced by \citet{guedj2018pycobra}. Numerical experiments assess the performance (in terms of pure prediction and computational complexity) of \texttt{KernelCobra} on real-life and synthetic datasets.Comment: 11 page

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

From industry-wide parameters to aircraft-centric on-flight inference: improving aeronautics performance prediction with machine learning

Author: Dewez Florent
Guedj Benjamin
Vandewalle Vincent
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2020
Field of study

Aircraft performance models play a key role in airline operations, especially in planning a fuel-efficient flight. In practice, manufacturers provide guidelines which are slightly modified throughout the aircraft life cycle via the tuning of a single factor, enabling better fuel predictions. However this has limitations, in particular they do not reflect the evolution of each feature impacting the aircraft performance. Our goal here is to overcome this limitation. The key contribution of the present article is to foster the use of machine learning to leverage the massive amounts of data continuously recorded during flights performed by an aircraft and provide models reflecting its actual and individual performance. We illustrate our approach by focusing on the estimation of the drag and lift coefficients from recorded flight data. As these coefficients are not directly recorded, we resort to aerodynamics approximations. As a safety check, we provide bounds to assess the accuracy of both the aerodynamics approximation and the statistical performance of our approach. We provide numerical results on a collection of machine learning algorithms. We report excellent accuracy on real-life data and exhibit empirical evidence to support our modelling, in coherence with aerodynamics principles.Comment: Published in Data-Centric Engineerin

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

UCL Discovery